{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "cd tokenization2arrow\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "pip install -U ipywidgets\n",
    "./venv/bin/jupyter lab\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
   "metadata": {},
   "source": [
    "##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n",
    "| Name | Description|\n",
    "| -----|------------|\n",
    "|tkn_tokenizer | Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used |\n",
    "|tkn_tokenizer_args |Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace|\n",
    "|tkn_doc_id_column|Column contains document id which values should be unique across dataset|\n",
    "|tkn_doc_content_column|Column contains document content|\n",
    "|tkn_text_lang|Specify language used in the text content for better text splitting if needed|\n",
    "|tkn_chunk_size|Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)|\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9a2fff6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove output folder\n",
    "!rm -rf output01"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6fafcb20-aa91-4b8b-90db-d8c8fcb4cc02",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set huggingface token to download llama tokenizer\n",
    "import os\n",
    "os.environ['HF_TOKEN'] = 'hf_XXX'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9669273a-8fcc-4b40-9b20-8df658e2ab58",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_tokenization2arrow.ray.runtime import Tokenization2Arrow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters for this transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "badafb96-64d2-4bb8-9f3e-b23713fd5c3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "Tokenization2Arrow(input_folder= \"test-data/ds01/input\",\n",
    "        output_folder= \"output01\",\n",
    "        tkn_tokenizer=  \"hf-internal-testing/llama-tokenizer\",\n",
    "        run_locally= True,\n",
    "        tkn_chunk_size= 20_000).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "### Explore output folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9a594810-f07f-4920-a1c9-54844b384351",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;34moutput01\u001b[0m\n",
      "├── \u001b[01;34mlang=en\u001b[0m\n",
      "│   ├── \u001b[01;34mdataset=cybersecurity_v2.0\u001b[0m\n",
      "│   │   └── \u001b[01;34mversion=2.3.2\u001b[0m\n",
      "│   │       └── \u001b[00mpq03.snappy.arrow\u001b[0m\n",
      "│   ├── \u001b[00mpq01.arrow\u001b[0m\n",
      "│   └── \u001b[00mpq02.arrow\u001b[0m\n",
      "├── \u001b[01;34mmeta\u001b[0m\n",
      "│   └── \u001b[01;34mlang=en\u001b[0m\n",
      "│       ├── \u001b[01;34mdataset=cybersecurity_v2.0\u001b[0m\n",
      "│       │   └── \u001b[01;34mversion=2.3.2\u001b[0m\n",
      "│       │       ├── \u001b[00mpq03.snappy.docs\u001b[0m\n",
      "│       │       └── \u001b[00mpq03.snappy.docs.ids\u001b[0m\n",
      "│       ├── \u001b[00mpq01.docs\u001b[0m\n",
      "│       ├── \u001b[00mpq01.docs.ids\u001b[0m\n",
      "│       ├── \u001b[00mpq02.docs\u001b[0m\n",
      "│       └── \u001b[00mpq02.docs.ids\u001b[0m\n",
      "└── \u001b[00mmetadata.json\u001b[0m\n",
      "\n",
      "8 directories, 10 files\n"
     ]
    }
   ],
   "source": [
    "!tree output01"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac1088b5-f102-4718-9286-f4236e55c1b2",
   "metadata": {},
   "source": [
    "### Check metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat output01/metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8efbf118-7480-4b5e-bf48-9022c5a1605d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
